---
title: "Reference: .chunk() | Document Processing | RAG | Kastrax Docs"
description: Documentation for the chunk function in Kastrax, which splits documents into smaller segments using various strategies.
---

# Reference: .chunk() ✅

The `.chunk()` function splits documents into smaller segments using various strategies and options.

## Example ✅

```typescript
import { MDocument } from '@kastrax/rag';

const doc = MDocument.fromMarkdown(`
# Introduction ✅
This is a sample document that we want to split into chunks.

## Section 1 ✅
Here is the first section with some content.

## Section 2  ✅
Here is another section with different content.
`);

// Basic chunking with defaults
const chunks = await doc.chunk();

// Markdown-specific chunking with header extraction
const chunksWithMetadata = await doc.chunk({
  strategy: 'markdown',
  headers: [['#', 'title'], ['##', 'section']],
  extract: {
    summary: true, // Extract summaries with default settings
    keywords: true  // Extract keywords with default settings
  }
});
```

## Parameters ✅

<PropertiesTable
  content={[
    {
      name: "strategy",
      type: "'recursive' | 'character' | 'token' | 'markdown' | 'html' | 'json' | 'latex'",
      isOptional: true,
      description:
        "The chunking strategy to use. If not specified, defaults based on document type. Depending on the chunking strategy, there are additional optionals. Defaults: .md files → 'markdown', .html/.htm → 'html', .json → 'json', .tex → 'latex', others → 'recursive'",
    },
     {
      name: "size",
      type: "number",
      isOptional: true,
      defaultValue: "512",
      description: "Maximum size of each chunk",
    },
    {
      name: "overlap",
      type: "number",
      isOptional: true,
      defaultValue: "50",
      description: "Number of characters/tokens that overlap between chunks.",
    },
    {
      name: "separator",
      type: "string",
      isOptional: true,
      defaultValue: "\\n\\n",
      description: "Character(s) to split on. Defaults to double newline for text content.",
    },
    {
      name: "isSeparatorRegex",
      type: "boolean",
      isOptional: true,
      defaultValue: "false",
      description: "Whether the separator is a regex pattern",
    },
    {
      name: "keepSeparator",
      type: "'start' | 'end'",
      isOptional: true,
      description:
        "Whether to keep the separator at the start or end of chunks",
    },
    {
      name: "extract",
      type: "ExtractParams",
      isOptional: true,
      description: "Metadata extraction configuration. See [ExtractParams reference](./extract-params) for details.",
    },
  ]}
/>

## Strategy-Specific Options ✅

Strategy-specific options are passed as top-level parameters alongside the strategy parameter. For example:

```typescript showLineNumbers copy
// HTML strategy example
const chunks = await doc.chunk({
  strategy: 'html',
  headers: [['h1', 'title'], ['h2', 'subtitle']], // HTML-specific option
  sections: [['div.content', 'main']], // HTML-specific option
  size: 500 // general option
});

// Markdown strategy example
const chunks = await doc.chunk({
  strategy: 'markdown',
  headers: [['#', 'title'], ['##', 'section']], // Markdown-specific option
  stripHeaders: true, // Markdown-specific option
  overlap: 50 // general option
});

// Token strategy example
const chunks = await doc.chunk({
  strategy: 'token',
  encodingName: 'gpt2', // Token-specific option
  modelName: 'gpt-3.5-turbo', // Token-specific option
  size: 1000 // general option
});
```

The options documented below are passed directly at the top level of the configuration object, not nested within a separate options object.

### HTML

<PropertiesTable
  content={[
    {
      name: "headers",
      type: "Array<[string, string]>",
      description:
        "Array of [selector, metadata key] pairs for header-based splitting",
    },
    {
      name: "sections",
      type: "Array<[string, string]>",
      description:
        "Array of [selector, metadata key] pairs for section-based splitting",
    },
    {
      name: "returnEachLine",
      type: "boolean",
      isOptional: true,
      description: "Whether to return each line as a separate chunk",
    },
  ]}
/>

### Markdown

<PropertiesTable
  content={[
    {
      name: "headers",
      type: "Array<[string, string]>",
      description: "Array of [header level, metadata key] pairs",
    },
    {
      name: "stripHeaders",
      type: "boolean",
      isOptional: true,
      description: "Whether to remove headers from the output",
    },
    {
      name: "returnEachLine",
      type: "boolean",
      isOptional: true,
      description: "Whether to return each line as a separate chunk",
    },
  ]}
/>

### Token

<PropertiesTable
  content={[
    {
      name: "encodingName",
      type: "string",
      isOptional: true,
      description: "Name of the token encoding to use",
    },
    {
      name: "modelName",
      type: "string",
      isOptional: true,
      description: "Name of the model for tokenization",
    },
  ]}
/>

### JSON

<PropertiesTable
  content={[
    {
      name: "maxSize",
      type: "number",
      description: "Maximum size of each chunk",
    },
    {
      name: "minSize",
      type: "number",
      isOptional: true,
      description: "Minimum size of each chunk",
    },
    {
      name: "ensureAscii",
      type: "boolean",
      isOptional: true,
      description: "Whether to ensure ASCII encoding",
    },
    {
      name: "convertLists",
      type: "boolean",
      isOptional: true,
      description: "Whether to convert lists in the JSON",
    },
  ]}
/>

## Return Value ✅

Returns a `MDocument` instance containing the chunked documents. Each chunk includes:

```typescript
interface DocumentNode {
  text: string;
  metadata: Record<string, any>;
  embedding?: number[];
}
```